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Safety Culture 

A weak safety culture makes it extremely difficult to create safe 
systems. 

Consequences: 

A poor safety culture dramatically elevates the risk of creating an 
unsafe product. If an organization cuts corners on safety, one 
should reasonably expect the result to be an unsafe outcome. 

Accepted Practices: 

• Establish a positive safety culture in which all stakeholders 
put safety first, rigorous adherence to process is expected, 
and all developers are incentivized to report and correct 
both process and product problems. 

Discussion: 

A “safety culture” is the set of attitudes and beliefs employees 
have to attaining safety. Key aspects of such a culture include a 
willingness to tell management that there are safety problems, 
and an insistence that all processes relevant to safety be followed 
rigorously. 

Part of establishing a healthy safety culture in an organization is 
a commitment to improving processes and products over time. 
For example, when new practices become accepted in an industry 
(for example, the introduction of a new version of the MISRA C 
coding style, or the introduction of a new safety standard such as 
ISO 26262), the organization should evaluate and at least 
selectively adopt those practices while formally recording the 
rationale for excluding and/or slow-rolling the adoption of new 
practices. (In general, one expects substantially all new accepted 
practices in an industry to be adopted over time by a company, 
and it is simply a matter of how aggressively this is done and in 
what order.) 

Ideally, organizations should identify practices that will improve 
safety proactively instead of reactively. But regardless, it is 
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unacceptable for an organization building safety critical systems 
to ignore new safety-relevant accepted practices with an excuse 
such as “that way was good enough before, so there is no reason 
to improve” - especially in the absence of a compelling proof that 
the old practice really was “good enough.” 

Another aspect of a healthy safety culture is aggressively 
pursuing every potential safety problem to root cause resolution. 
In a safety-critical system there is no such thing as a one-off 
failure. If a system is observed to behave incorrectly, then that 
behavior must be presumed to be something that will happen 
again (probably frequently) on a large deployed fleet. It is, 
however, acceptable to log faults in a hazard log and then 
prioritize their resolution based on risk analysis such as using a 
risk table (Koopman 2010, eh. 28). 

Along these lines, blaming a person for a design defect is usually 
not an acceptable root cause. Since people (developers and 
system operators alike) make mistakes, saying something like 
“programmer X made a mistake, so we fired him and now the 
problem is fixed” is simply scapegoating. The new replacement 
programmer is similarly sure to make mistakes. Rather, if a bug 
makes it through a supposedly rigorous process, the fact that the 
process didn’t prevent, detect, and catch the bug is what is 
broken (for example, perhaps design reviews need to be modified 
to specifically look for the type of defect that escaped into the 
field). Similarly, it is all too easy to scapegoat operators when the 
real problem is a poor design or even when the real problem is a 
defective product. In short, blaming a person should be the last 
alternative when all other problems have been conclusively ruled 
out - not the first alternative to avoid fixing the problem with a 
broken process or broken safety culture. 

Believing that certain classes of defects are impossible to the 
degree that there is no point even looking for them is a sure sign 
of a defective safety culture. For example, saying that software 
defects cannot possibly be responsible for safety problems and 
instead blaming problems on human operators (or claiming that 
repeated problems simply didn’t happen) is a sure sign of a 
defective safety culture. See, for example, the Therac 25 radiation 
accidents. No software is defect free (although good ones are 
nearly defect free to begin with, and are improved as soon as new 
hazards are identified). No system is perfectly safe under all 
possible operating conditions. An organization with a mature 
safety culture recognizes this and responds to an incident or 
accident in a manner that finds out what really happened (with 
no preconceptions as to whether it might be a software fault or 
not) so it can be truly fixed. It is important to note that both 
incidents and accidents must be addressed. A “near miss” must 
be sufficient to provoke corrective action. Waiting for people to 
die (or dozens of people to die) after multiple incidents have 
occurred and been ignored is unacceptable (for an example of 
this, consider the continual O-ring problems that preceded the 
Challenger space shuttle accident). 
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The creation of safe software requires adherence to a defined 
process with minimal deviation, and the only practical way to 
ensure this is by having a robust Software Quality Assurance 
(SQA) function. This is not the same as thorough testing, nor is it 
the same as manufacturing quality. Rather than being based on 
testing the product, SQA is based on defining and auditing how 
well the development process (and other aspects of ensuring 
system safety) have been followed. No matter how conscientious 
the workers, independent checks, balances, and quantifiable 
auditing results are required to ensure that the process is really 
being followed, and is being followed in a way that is producing 
the desired results. It is also necessary to make sure the SQA 
function itself is healthy and operational. 

Selected Sources: 

Making the transition from creating ordinary software to safety 
critical software is well known to require a cultural shift that 
typically involves a change from an all-testing approach to 
quality to one that has a balance of testing and process 
management. Achieving this state is typically referred to as 
having a “safety culture” and is necessary step in achieving safety. 
(Storey 1996, p. 107) Without a safety culture it is extremely 
difficult, if not impossible, to create safe software. The concept of 
a “safety culture” is borrowed from other, non-software fields, 
such as nuclear power safety and occupational safety. 

MISRA Software Guidelines Section 3.1.4 Assessment 
recommends an independent assessor to ensure that required 
practices are being followed (i.e., an SQA function). 

MISRA provides a section on “human error management” that 
includes: “it is recommended that a fear free but responsible 
culture is engendered for the reporting of issues and errors” 
(MISRA Software Guidelines p. 58) and “It is virtually impossible 
to prevent human errors from occurring, therefore provision 
should be made in the development process for effective error 
detection and correction; for example, reviews by individuals 
other than the authors.” 
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Go Beyond System Functional Testing To Ensure 
Safety 

Testing alone is insufficient to ensure safety in critical systems. 
Other technical approaches and software development process 
management approaches must also be used to assure sufficient 
software integrity. 

Consequences: 

Relying upon just system functional testing to achieve safety can 
be expected to eventually lead to an unsafe situation in a widely 
released product. Even if system functional testing is completely 
representative of situations that will happen in practice, such 
testing normally won’t be long enough to see all of the infrequent 
events that will occur with a much larger fleet of vehicles 
deployed for a much longer period of time. 

Accepted Practices: 


• Specifically identity and follow a process to design in safety 
rather than attempting to test it in after the product has 
already been built. The MISRA Guidelines describe an 
example of an automotive-specific process. 

• Include defined activities beyond hiring smart designers 
and performing extensive functional testing. While details 
might vary depending upon the project, as an example, an 
acceptable set of practices for critical software by the late 
1990s would have included the following (assuming that 
MISRA Safety Integrity Level 3 were an appropriate 
categorization of the functions): precisely written 
functional specifications, use of a restricted language subset 
(e.g., MISRA C), a way of ensuring compilers produced 
correct code, configuration management, change 
management, automated build processes, automated 
configuration audits, unit testing to a defined level of 
coverage, stress testing, static analysis, a written safety 
case, deadlock analysis, justification/demonstration of test 
coverage, safety training of personnel, and availability of 
written documentation for assessment of safety 
(auditability of the process). (The required level of care 
today is, if anything, even more rigorous for such systems.) 
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Discussion: 

There is a saying about quality: “You can’t test in quality; you 
have to design it in from the start.” It is well known that the same 
is true of safety. 

Assuring safety requires more than just using capable designers 
and performing extensive testing (although those two factors are 
important). Even the best designers - like all humans - are 
imperfect, and even the most extensive system-level functional 
testing cannot hope to find everything that can go wrong in a 
large deployed fleet such as an automobile. It should be apparent 
than everyone can make a mistake, even careful designers. But 
beyond that, system level functional testing (e.g., driving a car 
around in a variety of circumstances) cannot be expected to find 
all the defects in software, because there are just too many 
situations that can occur to experience them all in testing. This is 
especially true if a combination of events that causes a software 
failure just happens to be one that the testers didn’t think of 
putting into the test plan. (Test plans have bugs and gaps too.) 
Therefore, it has long been recognized that creating safe software 
requires more than just trying hard to get the design right and 
trying really hard to test well. 

Accepted practices require a holistic approach to safety, 
including executing a well-defined process, having a written plan 
to achieve safety, using techniques to ensuring safety such as 
fault tree analysis, and auditing the process to ensure all required 
steps are being performed. 

An accepted way of ensuring that safety has been considered 
appropriately is to have a written document that argues why a 
system is safe (sometimes called a safety case or safety 
argument). The safety case should give quantitative arguments as 
to why safety is inherent in the system. An argument that says 
“we tested for X hours” would be insufficient - unless it also said 
“and that covered 99.999% of all anticipated operating scenarios 
as well as thoroughly exercising every line of code” or some other 
type of argument that testing was thorough. After all, running a 
car in circles around a track is not the same level of testing as a 
cross-country drive over mountains. Or one that goes to Alaska in 
the winter and Death Valley in the summer. Or one that does so 
with 1000 cars to catch situations in which things inside one of 
those many cars just happen to line up in just the wrong way to 
cause a system failure. But even with the significant level of 
testing done by automotive companies, the safety case must also 
include things such as the level of peer reviews conducted, 
whether fault tree analysis revealed single points of failure, and 
so on. In other words, it’s inadequate to say “we tried really hard” 
or “we are really smart” or “we spent a whole lot of time testing.” 
It is essential to also justify that broad coverage was achieved 
using a variety of relevant techniques. 

Selected Sources: 
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Beatty, in a paper aimed at educating embedded system 
practitioners, explains that code inspections and testing aren’t 
sufficient to detect many common types of errors in complex 
embedded systems (Beatty 2003, pg. 36). He identifies five areas 
that require special attention: stack overflows, race conditions, 
deadlocks, timing problems, and reentrancy conditions. He states 
that “All of these issues are prevalent in systems that employ 
multitasking real-time designs.” 

Lists of techniques that could be applied to ensure safety beyond 
just testing have been well known for many years, with a 
relatively comprehensive example being IEC 61508 Part 7. 

Even if you could test everything (which you can’t), dealing with 
low-probability faults that can be expected to affect a huge 
deployed fleet of automobiles just takes too long. “It is impossible 
to gain confidence about a system reliability of 100,000 years by 
testing,” (written in reference specifically to drive-by-wire 
automobiles and their requirement for a mean-time-to-failure of 
1 billion hours) (Kopetz 2004, p. 32, emphasis per original) 

Butler and Finelli wrote the classical academic reference on this 
point, stating that attaining software needed for safety critical 
applications will “inevitably lead to a need for testing beyond 
what is practical” because the testing time must be longer than 
the acceptable catastrophic software failure rate. (Butler 1993, p. 
3, paper entitled “The infeasibility of quantifying the reliability of 
life-critical real-time software.”)) 

Knutson gives an overview of software safety practices, and 
makes it clear that testing isn’t enough to create a safe system: 
“Even if we are wary of these dangerous assumptions, we still 
have to recognize the limitations inherent in testing as a means 
of bringing quality to a system. First of all, testing cannot prove 
correctness. In other words, testing can show the existence of a 
defect, but not the absence of faults. The only way to prove 
correctness via testing would be to hit all possible states, which 
as we’ve stated previously, is fundamentally intractable.” 
(Knutson 2000, pg. 34). Knutson suggests peer reviews as a 
technique beyond testing that will help. 

NASA says that “You can’t test everything. Exhaustive testing 
cannot be done except for the most trivial of systems.” (NASA 
2004, p. 77). 

Kendall presents a case study for an electronic throttle control 
(with mechanical fail-safes) using a two-CPU approach (a “sub 
Processor” and a “Main Processor”). The automotive supplier 
elected to follow the IEC 1508 draft standard (a draft of the IEC 
61508 standard), also borrowing elements from the MISRA 
software guidelines. Steps that were performed include: 
preliminary hazard analysis with mapping to MISRA SILs, review 
of standards and procedures to ensure they were up to date with 
accepted practices; on-site audits of development processes; 
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FMEA by an independent agency; FTA by an independent 
agency; Markov modeling (a technique for analyzing failure 
probabilities); independent documentation review; mathematical 
proofs of correctness; and safety validation testing. (Kendall 
1996) Important points from this paper relevant to this case 
include: “it is well accepted that software cannot be shown to be 
suitable for [its] intended use bytesting alone” (id. pg. 6); 

“Software robustness must be demonstrated by ensuring the 
process used to develop it is appropriate, and that this process is 
rigorously followed.” (id., pg. 6); “safety validation must consider 
the effect of the vehicle under as many failure conditions as is 
possible to generate.” (id., p. 7). 

Roger Rivett from Rover Group wrote a paper in 1997 based on a 
collaborative government-sponsored research effort that 
specifically addresses how automotive manufacturers should 
proceed to ensure the safety of vehicles. He makes an important 
point that rigorous use of good software practice is required in 
addition to testing (Rivett 1997, pg. 3). He has four specific 
conclusions for achieving a level of “good practice” for safety: use 
a quality management system, use a safety integrity level 
approach; be compliant with a sector standard (e.g., MISRA 
Software Guidelines), and use a third party assessment to ensure 
that high-integrity levels have been achieved. (Rivett 1997, pg. 

10). 

MISRA Development Guidelines, section 3.6.1, provides a set of 
points that make it clear that testing is necessary, but not 
sufficient, to establish safety (MISRA Guidelines, pg. 49): 

3.6 Testing 

3.6.1 General 

3 .6. 1.1 The limitations of testing software based systems should be ] 
number of possible paths through a complete program means tl 
regime only a proportion of the paths will be executed. 

3 .6. 1.2 Testing forms one part of the overall verification and 
Figure 11 ). It should be remembered that quality can never be t( 

3 . 6 . 1.3 The purpose of testing is to discover errors, not to prove correc 
MISRA Testing Guidance (MISRA Software Guidelines, p. 49) 

This last point of the MISRA Guidelines is key - testing can 
discover if something is unsafe, but testing alone cannot prove 
that a system is safe. 

"Testing on its own is not adequate for assessing safety-related 
software." (MISRA report 2 pg. iv) In particular, system-level 
testing (such as at the vehicle level), cannot hope to uncover all 
the possible faults or exceptional situations can will result in 
mishaps. 
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Fail-Safe Mechanisms Must Be Tested 

Some systems base their safety arguments on the presence of 
“fail-safe” behaviors. In other words, if a failure occurs, the 
argument is that the system will respond in a safe way, such as by 
shutting down in a safe manner. If you have fail-safe 
mechanisms, you need to test them with a full range of faults 
within the intended fault model to make sure they work properly. 

Consequences: 

Failing to specifically test for mitigation of single points of failure 
means that there is no way to be sure that the mitigation really 
works, putting safety of the system into doubt. 

As an example, if a hardware watchdog timer is not turned on, it 
won’t reset the system, but there might be no way to tell whether 
the watchdog timer is on or not (or set to the wrong value, or 
otherwise used improperly) without specifically testing whether 
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the watchdog works or not. Thus, you can’t take credit for having 
a watchdog timer unless you have actually tested that it works for 
each fault that matters (or, if there are many such faults, argue 
that you have attained sufficient coverage with the tests that are 
run). 

Accepted Practices: 

• Each and every fail-safe mechanism and fault management 
mechanism must be tested, preferably on a fully integrated 
system. Such tests may be difficult to perform in normal 
functional testing and may require intentional fault 
injection from the outside of the system (e.g., breaking a 
sensor) or fault injection at test points inside the system 
(e.g., intentionally killing a task using special test support 
infrastructure). 

Discussion: 

Fault injection is the process of intentionally inducing a 
hardware or software fault and determining its effect upon the 
system. 

Fault management mechanisms, and especially fail-safe 
mechanisms, are often the key points upon which an argument as 
to the safety of a system rests. As an example, a safety case based 
on a watchdog timer detecting task failures requires that the 
watchdog timer actually work. While it is of course important to 
make sure that the system has been designed properly, there is 
no substitute for testing whether the watchdog timer is actually 
turned on during system test. (To revisit a point on system 
testing made elsewhere in my postings - system testing is not 
sufficient to ensure safety, but thorough system testing is 
certainly an important thing to do.) It is similarly important to 
specifically test every fault mode that must be handled by the 
system to ensure fault handling is done correctly. 

Some examples of fault tests that should be performed include: 
killing each task independently to ensure that the death of any 
task is caught by the watchdog (and, by extension, cannot cause 
an unsafe system state); overloading the system to ensure that it 
behaves safely in an unanticipated CPU overload situation; 
checking that diagnostic fail-safes detect the faults they are 
supposed to and react by putting the system into a safe state; 
disabling sensors; disabling actuators; and others. 

Another perspective on this topic is that ensuring safety usually 
involves arguing that all single points of failure have been 
mitigated to make the system safe. To demonstrate that the 
reasoning is accurate, a system must have corresponding failures 
injected to make sure that the mitigation approaches actually 
work, since the system’s safety case rests upon that assumption. 
This might include intentionally corrupting bits in memory, 
corrupting computations that take place, corrupting stack 
contents, and so on. 
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It is important to note that ordinary system functional testing 
tends to do a poor job at exercising fault mitigation mechanisms. 
As an example, if a particular task is never supposed to die, and 
testing has been thorough, then that task won’t die during 
normal functional testing (if it did, the system would be 
defective!). The point of detecting task death is to handle 
situations you missed in testing. But that means the mechanism 
to detect task death and perform a restart hasn’t been tested by 
normal system-level functional tests. Therefore, testing fail-safe 
mechanisms requires special techniques that intentionally 
introduce faults into the system to activate those fail-safes. 

Selected Sources: 

Safety critical systems are deemed safe only if they can withstand 
the occurrence of any single point fault. But, there is no way to 
know if they will really do that unless testing includes actually 
injecting representative single point faults to see if the system 
will respond in a safe manner. You can’t know if a system is 
safe if you don’t actually test its safety capabilities, 
and doing so requires fault injection. For example, if you 
expect a watchdog to detect failed tasks, you need to kill each and 
every task in turn to see if the watchdog really works. Arlat 
correctly states that “physical fault injection will always be 
needed to test the actual implementation of a fault tolerant 
system” (Arlat 1990, pg. 180) 

The need to actually test fail-safe mechanisms to see if they really 
work should be readily apparent to any engineer. Pullum 
discusses this topic by suggesting the use of fault injection 
(intentionally causing faults as a testing technique) in the context 
of “verification of integration of fault and error processing 
mechanisms” for creating dependable systems (Pullum 2001, pg. 
93 ). 

“Fault injection is important to evaluating the dependability of 
computer systems.... It is particularly hard to recreate a failure 
scenario for a large complex system.” (Hsueh et al., 1997 pg. 75, 
speaking about the need for fault injection as part of testing a 
system). Mariani refers to the IEC 61508 safety standard and 
concludes that “fault-injection will be mandatory for soft error 
sensitivity verification” for safety critical systems (Mariani03, pg. 
60). “A fault-tolerant computer system’s dependability must be 
validated to ensure that its redundancy has been correctly 
implemented and the system will provide the desired level of 
reliable service. Fault injection - the deliberate insertion of faults 
into an operational system to determine its response - offers an 
effective solution to this problem.” (Clark 1995, pg. 47). 

Fault injection must include all possible single-point faults, not 
just faults that can be conveniently injected via the pins or 
connectors of a component. Rimen et al. compared internal vs. 
external fault injection, and found that that only 9%-i2% of bit 
flip faults that occur inside a microcontroller could be tested via 
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external pin fault injection (Rimen et al. 1994, p. 76). In 1994, 
Karlsson reported on the effectiveness of using a radioactive 
isotope to inject faults into a microcontroller (Karlsson 1994). 
Later fault injection work by Karlsson’s research group was 
performed on automotive brake-by-wire applications, sponsored 
by Volvo (Aidemark 2002), clearly demonstrating the 
applicability of fault injection as a relevant technique for safety 
critical automotive systems. And other similar work found 
defects in a safety critical automotive network protocol. (Ademaj 
2003) 

A test specifically on an engine control program using fault 
injection caused “permanently locking the engine’s throttle at full 
speed.” (Vinter 2001). 

There are numerous other scholarly works in this area. An early 
example is Bossen (1981). Some others include: Arlat et al. 

(1989), Barton et al. (1990), Benso et al. (1999), Han (1995), and 
Kanawati (1995). As a more recent example, Baumeister et al. 
performed fault injection on an automotive braking controller via 
irradiating it and measuring the errors, finding that unprotected 
SRAM and unprotected microcontroller paths were both 
sensitive to upsets (Baumeister 2012, pg. 5) 

MISRA Software Guidelines take it for granted that fault 
management capabilities will be tested (e.g., MISRA Software 
Guidelines 34.8.3 pg. 44, MISRA Report 4 p. v) “Fault injection 
test” is recommended by ISO 26262-6 (pg. 23) for software 
integration, noting that “This includes injection of arbitrary 
faults in order to test safety mechanisms (e.g., by corrupting 
software or hardware components).” 

By the late 1990s fault injection tools had become quite 
sophisticated, and were capable of injecting faults while a system 
was running at full speed even if source code was not available 
(e.g., Carreira 1998). 

An example of a testing approach along these lines is E-GAS (E- 
GAS), which includes numerous tests based on auto 
manufacturer experience to ensure that various faults will be 
handled safely. 

It is important to note that while mitigation techniques such as 
watchdog timers are a good practice if implemented properly, 
they are not sufficient to guarantee safety in the face of random 
errors. For example, Gunneflo presents experimental evidence 
indicating that watchdog effectiveness is less than perfect, and 
depends heavily on the particular software being run. Gunneflo 
recommends: “To accurately estimate coverage and latency for 
watch-dog mechanisms in a specific system, fault injection 
experiments must be carried out with the final implementation of 
the system using the real software.” (Gunneflo 1989, pg. 347). In 
other words, even if you have a watchdog timer, you need to 
perform fault injection to understand whether there are holes in 
your fault tolerance approach. 
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Peer Reviews and Critical Software 

Every line of critical embedded software should be peer reviewed 
via a process that includes a physical face-to-face meeting and 
that produces an auditable peer review report. 

Consequences: 

Failing to perform peer reviews can reasonably be expected to 
increase the defect rate in software for several reasons. All real- 
world projects have limited time and resources, so by skipping or 
skimping on peer reviews developers have missed an easy chance 
to eliminate defects. With inadequate reviews, developers are 
spread thin chasing down bugs found during testing. 
Additionally, peer reviews can find defects that are impractical to 
find in most types of testing, especially in cases of fault 
management or handling unexpected/infrequent operating 
scenarios. 

Accepted Practices: 

• Every line of code must be reviewed by at least one 
independent, technically skilled person. That review must 
include actually reading the entirety of the code rather than 
just looking at selected portions. 

• Peer reviews must be documented so that it is possible to 
audit the fact that they took place and the effectiveness of 
the reviews. At a minimum this includes recording the 
name of the reviewer(s), the code reviewed, the date of the 
review, and the number of defects found. If no auditable 
documentation of software quality is available for 
incorporated components (e.g., safety certification or peer 
review reports), then new peer reviews must be performed 
on that third-party code. 

• Acceptable safety critical system software processes 
normally require a formal meeting-based review rather 
than a remote review, e-mail review, or other casual 
checking mechanism. 
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Discussion: 

Peer reviews involve having an independent person - other than 
the author - look at source code and other design documents. 
The main purposes of the review are to ensure that code 
conforms to style guidelines and to find defects missed by the 
author of the code. Running a static analysis tool is not a 
substitute for a peer review, and neither is an in-person 
discussion that solely discusses the output of a static analysis 
tool. A proper peer review requires having an independent 
person (or, strongly preferable, a small group of independent 
reviewers) read the code in its entirety to ensure quality. The 
everyday analogy to a peer review is having someone else proof¬ 
read something you’ve written. It is nearly impossible to see all 
our own mistakes whether we are writing software or writing 
English prose. 

It is well known that more formal reviews provide more efficient 
and effective results, with the gold standard being what is known 
as a Fagan Style Inspection (a “code inspection”) that involves a 
pre-review, a formal meeting with defined roles, a written review 
report, and follow up actions. Regardless of the type of review, 
accepted practice is to record the results of reviews and audit 
them to make sure every single line of code has been reviewed 
when written, and re-reviewed when a module has been 
modified. 



General code inspection process. 

Source: http://helmi.h0me.pages.at/idimt2000/idimt2000.html 

MISRA requires a “structure program review” for SIL 2 and 
above. (MISRA Report 2 p. ix). MISRA specifically lists “Fagan 
Inspection” as a type of review (MISRA Software Guidelines p. 
12), and devotes two appendices of a report on verification and 
validation to “walkthroughs,” listing structured walkthroughs, 
code inspections, Fagan inspections, and peer reviews (MISRA 
Report 6 pp. 132-136). MISRA points out that walkthroughs 
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(their general term for peer reviews) “are acknowledged to be an 
effective process for identifying errors in programs - indeed they 
can be more effective than computer-based testing for certain 
types of error.” 

MISRA also points out that fixing a bug may make things worse 
instead of better, and says that code reviews and analysis should 
be used to validate bug fixes. (MISRA Report 5 p. 135) 

494. Peer reviews are somewhat labor intensive, and might 
account for 10% of the effort on a project. However, it is common 
for good peer reviews to find 50% or more of the defects in a code 
base, and thus finding defects via peer review is much cheaper 
than finding them via testing. Ineffective reviews can be 
diagnosed by the fact that they find far fewer defects. Acceptable 
peer reviews normally find defects that would be missed by 
testing, especially in parts of the code that are difficult to test 
thoroughly (for example, exception and failure management 
code). 

Selected Sources: 

McConnell devotes Chapter 24 to a discussion of reviews and 
inspections (McConnell 1993). Boehm & Basili summarized best 
practices for reducing software defects, and included the 
following point relevant to peer reviews: “Peer reviews catch 60 
percent of the defects.” (Boehm 2001, pg. 137). 

Ganssle lists four steps that should be the first steps taken to 
improve software quality. They are: “1. Buy and use a version 
control system; 2. Institute a Firmware Standards Manual; 3. 
Start a program of Code Inspection; 4. Create a quiet 
environment conducive to thinking.” #3 is his term for peer 
reviews, indicating his recommendation for formal code 
inspections. He also says that he knows companies that have 
made all these changes to their software process in a single day. 
(Ganssle 2000, p. 13). (Ganssle’s #2 item is coding style, 
discussed in Section 8.6.). 

MISRA Software Guidelines list the following as techniques on a 
one-picture overview of the software lifecycle: “Walkthrough, 
Fagan Inspection, Code Inspection, Peer Review, Argument, etc.” 
(MISRA Guidelines 1994, pg. 20) indicating the importance of 
formal peer reviews in a safety critical software lifecycle. Integrity 
Level 2 (which is only somewhat safety critical) and higher 
integrity levels require a “structured program review” (pg. 29). 
That document also gives these rules: “3.5.2.2 Before dynamic 
testing begins the code should be reviewed in accordance with 
the software verification plan to ensure that it does conform to 
the design specification” (pg. 56) and “3-5.2.3 Code reviews 
and/or walkthroughs should be used to identify any 
inconsistencies with the specifications” (pg. 56) and “4.34.3 The 
communication of information regarding errors to design and 
development personnel should be as clear as possible. For 
example, errors found during reviews should be fully recorded at 
the point of detection.” 
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MISRA C rule 116 states: “All libraries used in production code 
shall be written to comply with the provisions of this document, 
and shall have been subject to appropriate validation” (MISRA C 
pg. 55 )- Within the context of embedded systems, an operating 
system such as OSEK would be expected to count as a “library” in 
that it is code included in the system that is relied upon for 
safety, and thus should have been subject to appropriate 
validation, which would be expected to include peer reviews. If 
there is no evidence of peer review or safety certification, the 
system designer should perform peer reviews on the OS code 
(which is an excellent reason to use a safety certified OS!) 

Fagan-style inspections are a formal version of a “peer review,” 
which involves multiple software developers looking at software 
and other design artifacts to find defects. Fagan-style inspections 
originated at IBM (Fagan 1976). A later paper presented updated 
techniques, concluding that “inspections increase productivity 
and improve final program quality. Furthermore, improvements 
in process control and project management are enabled by 
inspections.” (Fagan 1986). It is widely recognized that Fagan- 
style inspections are a best practice, and that some sort of 
effective peer review technique is an accepted practice. 

Fagan-style Formal Inspections are recommended by the FAA 
(FAA 2000, p. J-23). IEC 61503-3 highly recommends 
performing some sort of design review on all software at all SILs, 
and recommends Fagan inspections at SIL4. (p. 91). 
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Use of Static Analysis Tools 

Critical embedded software should use static checking tools with 
a defined and appropriate set of rules, and should have zero 
warnings from those tools. 

Consequences: 

While rigorous peer reviews can catch many defects, some 
misuses of language are easy for humans to miss but 
straightforward for a static checking tool to find. Failing to use a 
static checking tool exposes software to a needless risk of defects. 
Ignoring or accepting the presence of large numbers of warnings 
similarly exposes software to needless risk of defects. 

Accepted Practices: 

• Using a static checking tool that has been configured to 
automatically check as many coding guideline violations as 
practicable. For automotive applications, following all or 
almost all (with defined and justified exceptions) of the 
MISRA C coding standard rules is an accepted practice. 

• Ensuring that code checks “clean,” meaning that there are 
no static checking violations. 

• In rare instances in which a coding rule violation has been 
formally approved, use pragmas to formally document the 
deviation and direct the static checking tool not to issue a 
warning. 

Discussion: 

Static checking tools look for suspicious coding structures and 
data use within a piece of software. Traditionally, they look for 
things that are “warnings” instead of errors. The distinction is 
that an error prevents the compiler from being able to generate 
code that will run. In contrast, a warning is an instance in which 
code can be compiled, but in which there is a substantial 
likelihood that the code the compiler generates will not actually 
do what the designer wants it to do. Reasons for a warning might 
include ambiguities in the language standard (the code does 
something, but it’s unclear whether what it does is what the 
language standard meant), gaps in the language standard (the 
code does something arbitrary because the language standard 
does not standardize behavior for this case), and dangerous 
coding practices (the code does something that is probably a bad 
idea to attempt). In other words, warnings point out potential 
code defects. Static analysis capabilities vary depending upon the 
tool, but in general are all designed to help find instances of poor 
use of a programming language and violations of coding rules. 

An analogous example to a static checking tool is the Microsoft 
Word grammar assistant. It tells you when it thinks a phrase is 
incorrect or awkward, even if all the words are spelled correctly. 
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This is a loose analogy because creativity in expression is 
important for some writing. But safety critical computer code 
(and English-language writing describing the details of how such 
systems work) is better off being methodical, regular, and 
precise, rather than creatively expressed but ambiguous. 

Static checking tools are an important way of checking for coding 
style violations. They are particularly effective at finding 
language use that is ambiguous or dangerous. While not every 
instance of a static checking tool warning means that there is an 
actual software defect, each warning given means that there is 
the potential for a defect. Accepted practice for high quality 
software (especially safety critical software) is to eliminate all 
warnings so that the code checks “clean.” The reasons for this 
include the following. A warning may seem to be OK when 
examined, but might become a bug in the context of other 
changes made to the code later. A multitude of warnings that 
have been investigated and deemed acceptable may obscure the 
appearance of a new warning that indicates an actual bug. The 
reviewer may not understand some subtle language-dependent 
aspect of a warning, and thus think things are OK when they are 
actually not. 

Selected Sources: 

MISRA Guidelines require the use of “automatic static analysis” 
for SIL 3 automotive systems and above, which tend to be 
systems that can kill or severely injure at least one person if they 
fail (MISRA Guidelines, pg. 29). The guidelines also give this 
guidance: “3.5.2.6 Static analysis is effective in demonstrating 
that a program is well structured with respect to its control, data, 
and information flow. It can also assist in assessing its functional 
consistency with its specification.” 

McConnell says: “Heed your compiler's warnings. Many modern 
compilers tell you when you have different numeric types in the 
same expression. Pay attention! Every programmer has been 
asked at one time or another to help someone track down a pesky 
error, only to find that the compiler had warned about the error 
all along. Top programmers fix their code to eliminate all 
compiler warnin gs. It's easier to let the compiler do the work 
than to do it yourself.” (McConnell, pg. 237, emphasis added). 
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Coding Style Guidelines and MISRA C 

Critical embedded software should follow a well-defined set of 
coding guidelines, enforced with comprehensive static checking 
tools, with essentially no deviations. MISRA C is an example of 
an accepted set of such coding guidelines. 

Consequences: 

Coding style guidelines exist to make it more difficult to make 
mistakes, and also to make it easier to detect when mistakes have 
been made. Failing to establish or follow formal, written coding 
guidelines makes it more difficult to understand code, leading to 
less effective code reviews and a reasonable expectation of 
increased levels of software defects. Failing to follow the 
language usage rules defined by a coding style guideline leads to 
using the language in a way that can normally be expected to 
result in poorly defined or incorrect software behaviors, increases 
the risk of software defects. 

Accepted Practices: 


• All projects should follow a written coding style guideline 
document. 

• Coding guidelines should address formatting, commenting, 
name use, and other similar topics. 

• Coding guidelines should address good language usage 
practices to create understandable code and reduce the 
chance of introducing software defects. 

• Coding guidelines should specifically address which 
language features and usage patterns should be avoided as 
being error-prone, dangerous, or undefined by the 
language standard. 

• Coding guidelines should be followed with essentially no 
exception. Exceptions should require formal review with 
written approval and annotations in the source code. If 
guidelines are inappropriate, the guidelines should be 
changed. 


Discussion: 

Style in software can be considered analogous to style in writing. 
Compilers enforce some basic programming language 
construction rules that allow the source code to be compiled into 
executable software. Style, on the other hand, has more to do 
with how ideas are expressed within the constraints of the 
programming language used. Some style considerations have to 
do with variable naming conventions, indentation, and physical 
organization aspects of lines of code. Other style considerations 
have to do with the use of the programming language itself. Some 
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constructs in a programming language are ambiguous or easily 
misunderstood. And some constructions in software, while 
correct in terms of language definition, are very likely to indicate 
a software defect. 

A classic example of a subtle defect in the C programming 
language is: 

“if (x = y) { ...}” The programmer almost always means to 
compare “x” and “y” for equality, but the C programming 
language is defined such that this code instead copies the value of 
“y” into “x” and then tests to see if the result of that copy 
operation was non-zero. The correct code would be “if (x == y) { 

... }” which adds a second “=” to make the operation a 
comparison instead of an assignment operation. Using “=” 
instead of “==” in conditionals is a common mistake when 
creating C programs, easy to confuse visually, and is therefore 
prohibited by typical style guidelines even though the single “=” 
version of the code is a valid language construct. A loose analogy 
might be a prohibition against using multiple negatives in an 
English language sentence because it is too difficult to not not 
not not not not (sic) make a mistake with such a sentence even 
though the meaning is unambiguous if the sentence is carefully 
(and correctly) analyzed. 

It is accepted practice to have a defined set of coding guidelines 
that cover all relevant aspects of programming language use. 

Such guidelines are typically tailored for each project, but once 
defined should be followed rigorously. Guidelines typically cover 
code formatting, commenting, use of names, language use 
conventions, and other relevant aspects. While guidelines can be 
tailored per project, there are nonetheless a number of generally 
accepted practices for reducing the chance of software defects 
(such as forbidding a single “=” in a conditional evaluation as just 
discussed). 

Of particular concern in safety critical software are language use 
rules to avoid ambiguity and hazardous language constructs. It is 
a generally accepted practice to outright ban hazardous and 
error-prone language structures to avoid the chance of defects, 
even if doing so makes software a bit less convenient to write and 
those structures would otherwise be a legal use of the language. 

In other words, an essential aspect of coding style for safety 
critical systems is outlawing code structures that are technically 
valid but are too dangerous or error-prone to use. The result is a 
written document that defines coding style in general, and 
language usage rules in particular. These rules must be applied 
rigorously and with essentially no exceptions. (The “no 
exceptions” part is feasible because it is acceptable to tailor the 
rules to the particular project. So it is not a matter of applying 
arbitrary rules and making exceptions, but rather picking rules 
that make sense for the situation and then rigorously sticking to 
them.) In short, every safety critical software project must have a 
coding style guide and must follow it rigorously to achieve 
acceptable levels of safety. 
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It is often the case that it makes sense to adopt an existing set of 
language use rules rather than make up your own. The MISRA C 
set of coding rules was specifically created for safety critical 
automotive software, and is the most well known C programming 
language subset for safety critical software. A typical practice 
when writing safety critical C code is to start with MISRA C, 
create a defined set of which rules will be followed (usually this is 
almost all of them), and then follow those rules rigorously. 
Exceptions to adopted rules should be very few, granted only 
after a formal written review process, and documented in the 
code as to the type of exception and reason for granting it. 
Preferably, automated tools (widely available for MISRA C as 
discussed in Jones 2002, pg. 56) are used to enforce the rules in 
addition to a required peer review of code. 

It is accepted practice to adopt new coding style rules when 
better practices come into use, and apply those coding style rules 
to existing code when that code is being updated and 
incorporated into a new product. 

Selected Sources: 

The MISRA C guidelines (MISRA C 1998) are specifically 
designed for safety critical systems at SIL 2 and above. (MISRA 
Report 2 p. ix) They consist of a list of rules about coding 
practices to use and practices to avoid. They concentrate 
primarily on language use rather than code formatting. While 
MISRA C was originally developed for automotive applications, it 
was being set forth as a more general standard for adoption in 
other domains by 2002 (e.g., Jones 2002). Over time, MISRA C 
has transition beyond just automotive applications to 
mainstream use for high integrity software in other areas. A 
predecessor of MISRA C is the list of rules in the book Safer C 
(Hatton, 1995 )- 

More general coding style guidelines abound. It is easy to find a 
coding style guideline that can be adapted for the specifics of a 
project. Examples include chapter 18 of McConnell (1993). 

NASA says that it is important that all levels of the project agree 
to the coding standards, and that they are enforced. (NASA 
2004 pg. 146) 

Enforcing coding style involves the use of static checking in 
addition to formal peer reviews. Beyond the general consensus in 
the software community that following a defined coding style is a 
good idea, Nagappan and Ball found that “there exists a strong 
positive correlation between the static analysis defect density and 
the pre-release defect density determined by testing. Further, the 
predicted pre-release defect density and the actual pre-release 
defect density are strongly correlated at a high degree of 
statistical significance.” (Nagappan 2005, abstract) In other 
words, modules that fail to follow a coding style as determined by 
static analysis have more bugs. 
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McConnell says: “Heed your compiler's warnings. Many modern 
compilers tell you when you have different numeric types in the 
same expression. Pay attention! Every programmer has been 
asked at one time or another to help someone track down a pesky 
error, only to find that the compiler had warned about the error 
all along. Top programmers fix their code to eliminate all 
compiler warnings. It's easier to let the compiler do the work 
than to do it yourself.” (McConnell 1993, pg. 237). 

MISRA Software Guidelines require the use of “automatic static 
analysis” for SIL 3 automotive systems and above, which tend to 
be systems that can kill or severely injure at least one person if 
they fail (MISRA Guidelines 1994, pg. 29). The guidelines also 
give this guidance: “3.5.2.6 Static analysis is effective in 
demonstrating that a program is well structured with respect to 
its control, data, and information flow. It can also assist in 
assessing its functional consistency with its specification.” IEC 
61508 highly recommends (which more or less means “requires” 
as an accepted practice) static analysis at SIL 2 and above (IEC 
61508-3 pg. 83). 

Finally, an automotive manufacturer has published data showing 
that they expect one “major bug” for every 30 coding rule 
violations (Kawana 2004): 


causes of malfunctions. More than a thousand malfunction 
cases in last three years have been analyzed, where error factors 
and causes are identified. It is statistically shown that 1 
serious-level bug is found with three minor bugs and 10 light-level 
coding-rule violations imply 1 serious-level bug in our cunent 
design processes (see in Fig. 2). A software with many 
coding-rule violations statistically is predicted to have a 
serious-level bug. By this estimation formula, code level 
improvement is justified to contribute final product quality 
assurance. 


Major Bugs 
Minor Bugs 
Rule Violation 



30 


Fig2.Software Quality Index 


(Figure from Kawana 2004) 

Note: As with my other posts in the last few months this was 
written regarding practices about a decade ago. There are newer 
sources for coding style information available now such as an 
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updated version of MISRA C. There is also the the ISO-26262 
standard, which is intended to replace the MISRA software 
guidelines.. But we'll save discussing those for another time. 
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Don’t Overflow the Stack 

Somewhere in your embedded system is the stack (or several of 
them for some multitasking systems). If you blow up the stack, 
your software will crash. Or worse, especially if you don’t have 
memory protection. For a critical system you need to make sure 
the stack has some elbow room and make sure that you know 
when you have a stack size problem. 

Also, you shouldn’t ever use recursion in a safety critical 
embedded system. (This shouldn’t even need to be said, but 
apparently it does.) 

Consequences: The consequences of not understanding 
maximum stack depth can be a seemingly random memory 
corruption as the stack overwrites RAM variables. Whether or 
not this actually causes a program crash depends upon a number 
of factors, including chance, and in the worse case it can cause 
unsafe program behavior without a crash. 

Accepted Practices: 

• Compute worst case stack depth using a static analysis tool. 

• Include a stack sentinel or a related technique in the 
supervisor task and perform a graceful shutdown or reset 
prior to an actual overflow. 
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• Avoid all recursion to ensure that worst case stack depth is 
bounded. 


Discussion: 

The “stack” is an area in memory used for storing certain types of 
data. For example, in the C programming language this is where 
non-static local variables go. The stack gets bigger every time a 
subroutine is called, usually holding the subroutine return 
address, parameter values that have been passed, and local 
variables for each currently active subroutine. As nested 
subroutines are called, the stack keeps getting bigger as more 
information is added to the stack to perform each deeper and 
deeper call. When each subroutine is completed, the stack 
information is returned to the stack area for later use, un-nesting 
the layers of stack usage and shrinking the stack size. Thus the 
maximum size of the stack is determined by how many 
subroutines are called on top of each other (the depth of the 
subroutine call graph), as well as the storage space needed by 
each of those subroutines, plus additional area needed by any 
interrupt service routines that may be active at the same time. 


Some processors have a separate hardware stack for subroutine 
return information and a different stack for parameters and 
variables. And many operating systems have multiple stacks in 
support of multiple tasks. But for the most part similar ideas 
apply to all embedded controllers, and we’ll just discuss the 
single-stack case to keep things simple. 


MEMORY 

SPACE 



CURRENT 

STACK 

DEPTH 


It is common for stacks to grow “downward” from high memory 
locations to low locations, with global variables, Real Time 
Operating System (RTOS) task information, or other values such 
as the C heap being allocated from the low memory locations to 
high memory locations. There is an unused memory space in 
between the stack and the globals. In other words, the stack 
shares limited RAM space with other variables. Because RAM 
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space is limited, there is the possibility of the stack growing so 
large that it overlaps variable storage memory addresses so that 
the stack corrupts other memory (and/or loads and stores to 
global memory corrupt the stack). To avoid this, it is accepted 
practice to determine the worst case maximum stack depth and 
ensure such overlap is impossible. 

Maximum stack depth can be determined by a number of means. 
One way is to periodically sample the stack pointer while the 
program is running and find the maximum observed stack size. 
This approach is a starting point, but one should not expect that 
sampling will happen to catch the absolute maximum stack 
depth. Catching the worst case can be quite difficult because 
interrupt service routines also use the stack and run at times that 
are generally unpredictable compared to program execution flow. 
(Moreover, if you use a timer interrupt to sample the stack 
pointer value, you’ll never see the stack pointer when the timer 
interrupt is masked, leaving a blind spot in this technique.) So 
this is not how you should determine worst case stack depth. 

A much better approach is to use static analysis tools specifically 
designed to find the worst case set of subroutine calls in terms of 
stack depth. This technique is effective unless the code has 
structures that confound the analysis (e.g., if a program uses 
recursion, this technique generally doesn’t work). When the 
technique does work, it has the virtue of giving an absolute 
bound without having to rely upon whether testing happened to 
have encountered the worst case. You should use this technique if 
at all possible. If static analysis isn’t possible because of how you 
have written your code, you should change your code so you 
actually can do static analysis for maximum stack depth. In 
particular, this means you should never use recursion in a 
critical system both because it makes static analysis of stack 
depth impossible. Moreover, recursive routines are prone to 
stack overflows even if they are bug free, but happen to be fed an 
exceptional input value that causes too many levels of recursion. 
(In general, using recursion in any small-microcontroller 
embedded system is a bad idea for these reasons even if the 
system is not safety critical.) 
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KEEP PUTTING 
COPIES OF DATA 
ONTO STACK, 
CAUSING 
OVERFLOW 


Beyond static analysis of stack depth (or if static analysis isn’t 
possible and the system isn’t safety critical), an additional 
accepted practice is to use a “stack sentinel” to find the “high 
water mark” of the stack (I’ve also heard this called a “stack 
watermark” and a “stack guard”). With this technique each 
memory location in the stack area of memory is initialized to a 
predetermined known value or pattern of known values. As the 
stack grows in size, it will overwrite these known sentinel values 
with other values, leaving behind new patterns in memory that 
remain behind, like footprints in fresh snow. Thus, it is easy to 
tell how big the stack has ever been since the system was started 
by looking for how far into the unused memory space the 
“footprints” of stack usage trample past the current stack size. 
This both gives the maximum stack size during a particular 
program run, and the overall system maximum stack size 
assuming that the worst case path has been executed. (Actually 
identifying the worst case path need not be done - it is sufficient 
to have run it at some time during program execution without 
knowing exactly when it happened. It will leave its mark on the 
stack memory contents when it does happen.) 
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To convert this idea into a run-time protection technique, extra 
memory space between the stack and other data is set up as a 
sacrificial memory area that is there to detect unexpected 
corruptions before the stack can corrupt the globals or other 
RAM data. The program is run, and then memory contents are 
examined periodically to see how many stack area memory words 
are left unchanged. As part of the design validation process, 
designers compare the observed worst case stack depth against 
the static analysis computed worst case stack depth to ensure 
that they understand their system, have tested the worst case 
paths, and have left adequate margin in stack memory to prevent 
stack overflows if they missed some special situation that is even 
worse than the predicted worst case. 

This sentinel technique should also be used in testing and in 
production systems to periodically check the sentinels in the 
sacrificial memory area. If you have computed a worst case stack 
depth at design time and you detect that the computed depth has 
been exceeded at run time, this is an indication that you have a 
very serious problem. 

If the stack overflows past the sacrificial memory space into 
global variables, the system might crash. Or that might just 
corrupt global variables or RTOS system state and have who- 
knows-what effect. (In a safety critical system “who-knows-what” 
equals “unsafe.”) Naively assuming that you will always get a 
system crash detectable by the watchdog on a stack overflow is a 
dangerous practice. Sometimes the watchdog will catch a stack 
overflow. But there is no guarantee this will always happen. 
Consider that a stack-smashing attack is the security version of 
an intentional stack overflow, but is specifically designed to take 
over a system, not just merely crash it. So a crash after a stack 
overflow is by no means a sure thing. 
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Avoiding stack overflow problems is a matter of considering 
program execution paths to avoid a deep sequence of calls, and 
accounting for interrupts adding even more to the stack. And, 
again, even though we shouldn’t have to say this - never using 
recursion. 

While there isn’t a set number for how much margin to leave in 
terms of extra memory past the computed maximum stack size, 
there are two considerations. First, you want to leave enough 
room to have an ample sacrificial area so that a problem with 
stack depth is unlikely to have enough time to go all the way 
through the sacrificial area and touch the globals before it is 
detected by a periodic timer tick checking sentinels. (Note: we 
didn’t say check current stack depth; we said periodically check 
sentinels to see if the stack has gotten too big between checks.) 

Also, you want to leave some extra margin to account for the 
possibility you just never encountered the worst-case stack depth 
in analysis or testing. I’ve heard designers say worst case stack 
depth of 50% of available stack memory is a good idea. Above 
90% use of stack memory (10% sacrificial memory area set aside) 
is probably a bad idea. But the actual number will depend on the 
details of your system. 

Selected Sources: 

Stack sentinels and avoidance of recursion are an entrenched 
part of embedded systems folklore and accepted practices. 

Douglass mentions watermarking in Section 9.8.6 (Douglass 
2002). 

MISRA Software Guidelines discourage recursion, saying that it 
can cause unpredictable behavior (MISRA Guidelines, pg. 20). 

MISRA C required rule 70 explicitly bans recursion: 

Rule 70 (required): Functions shall not call themselves, either 1 


This means that recursive function calls cannot be used in safety-related 

with it the danger of exceeding available stack space, which can be a seri 
is very tightly controlled, it is not possible to determine before executior 
usage could be. 

NASA recommends using stack guards (essentially the same as 
the technique that I call “stack sentinels”) to check for stack 
overflow or corruption (NASA 2004, p. 93). 

Stack overflow errors are well known to corrupt memory and 
cause arbitrarily bad behavior of a running program. Regehr et 
al. provide an overview of research in that area relevant to 
embedded microcontrollers and ways to mitigate those problems 
(Regehr 2006). He notes that “the potential for stack overflow in 
embedded systems is hard to detect by testing.” (id., p. 776), with 
the point being that it is reasonable to expect that a system which 
has been tested but not thoroughly analyzed will have run-time 
stack overflows that corrupt memory. 
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Avoid Concurrency Defects 

Accesses to variables shared among multiple threads of execution 
must be protected via disabling interrupts, using a mutex, or 
some other rigorously applied concurrency management 
approach. 

Consequences: 

Incorrect or lax use of concurrency techniques can be expected to 
lead to concurrency bugs. Such bugs are usually difficult to detect 
via testing and difficult to reproduce once detected. A fraction of 
any such bugs can be reasonably expected to make it past even 
extensive testing into production fleets. 

Accepted Practices: 

• Aggressively minimize the use of globally shared variables. 
Every variable shared between tasks is a chance for a 
concurrency defect. 

• For every access to a shared variable, treat the entire time 
that a copy of the variable is “live” in the computation as a 
critical section, protecting access via masking interrupts or 
some other well defined technique. 

• Avoid concurrency management techniques that are “home 
brewed” or otherwise not part of well proven practice. 
Similarly, do not modify well known techniques to optimize 
efficiency or obtain other perceived benefits. Such 
techniques are extremely difficult to get right, and altering 
techniques or using non-standard techniques can be 
reasonably expected to introduce defects. 

• Declare every shared variable “volatile” to ensure that reads 
and writes do not result in stale data being used due to 
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compiler optimization attempts to improve computation 
speed. 

• Keep critical sections as short as possible to minimize 
negative effects on scheduling. (The largest critical section 
forms a minimum length on blocking time for scheduling 
purposes). 

Discussion: 


One aspect of a modern real time embedded system is that it 
must appear to do many things at once. For example, an engine 
controller must look at many inputs, set throttle angle and 
perform diagnostic checks all at the same time. Typically many of 
these tasks are written as relatively independent pieces of 
software that must work together, and they must all appear to 
run at the same time. 


In reality there is only one CPU, so tasks must take turns using 
that CPU, with that turn taking supervised by the RTOS. And, 
tasks often must share things such as memory locations and 
input/output devices. The turn-taking means that, unless 
software designers are extremely careful, on occasion shared 
resources can be left in undefined or incorrect states, resulting in 
concurrency bugs. 

■ Correct behavior ■ Incorrect behavior 


Taskl 


Shared 

data 


Task 2 


edit 


read 


write 


Task 2 gets the 
updated value 
from task 1 


read 


write 


edit 



Example concurrency defect. 

Source: http://blogs.windriver.com/engblom/20io/06/true- 
concurrency-is-truly-different-again.html (That blog post has a 
nice discussion of how situation-dependent such defects are) 

Avoiding and fixing concurrency bugs is a major source of design 
and testing effort on most embedded systems. In part this is 
because concurrency bugs can be quite subtle, and in part it is 
because they can be very difficult to activate during testing as 
well as difficult to isolate even when one is observed (if they are 
observed at all during testing). The difficulty of detecting and 
fixing concurrency defects, as well as the reasonable probability 
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that they won’t be seen at all in testing, makes disciplined use of 
good practices essential. 

A variety of techniques are available to avoid concurrency 
problems. A preferred approach is avoiding situations in which 
concurrency bugs are possible. For example, avoiding the use of 
shared global variables avoids associated concurrency defects 
(because the shared variables simply aren’t there to begin with). 
But, when sharing can’t be avoided, there are well defined basic 
techniques that work. Typically such techniques work by 
“locking” some resource so that other tasks cannot use it. As an 
analogy, consider a changing room at a clothing store. If you 
want to make sure that nobody else tries to use the room you are 
using, you need to “lock” the room when you enter (with an 
actual door lock, or maybe just by closing the door or curtain all 
the way), and then “unlock” the room when you leave. If you 
never lock the door it might be that nothing bad happens for 
dozens or hundreds of times you try on clothes. But eventually 
someone will wander into the room by mistake while you are 
there if you don’t lock the door. 

Disabling interrupts is a concurrency design approach that can 
be thought of as a program “locking” the CPU so that no other 
task can use it when a shared variable is being accessed. The way 
this works is that any task wanting to, say, increment a variable 
first disables interrupts, then increments the variable, then re¬ 
enables interrupts. The period of time between the first read of a 
variable and when the variable is done being updated is known as 
a “critical section,” and is the time during which no other task 
can be permitted to access the variable. Disabling interrupts 
turns off the hardware’s ability to switch tasks or perform 
anything but the desired computation during the critical section. 
This ensures that no other task in the system can read or write 
the shared variable, because disabling interrupts prevents any 
other task from running. It is essential that every single access to 
a shared variable disable interrupts for the entire use of the 
variable for this to be guaranteed to work. If a local copy of the 
variable is kept and used outside the time during which 
interrupts are disabled, there are no guarantees as to how the 
system will behave when that local copy is subsequently used to 
update the variable. Other techniques are available to manage 
concurrency beyond disabling interrupts. But, this is a common 
technique in embedded systems. 

Even using these techniques, special care must be used in 
accessing any shared resource. For example, the keyword volatile 
must be used for every shared resource to ensure that the most 
up to date copy is always accessed. (But, even this won’t help if 
that copy is updated at an unexpected time.) 

Selected Sources: 


Ball describes concurrency defects in terms of being race 
conditions and prescribes disabling interrupts to solve the 
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problem. (Ball 2002, pp. 162). 

Douglass gives a pattern for a critical section in section 7.2, and 
in section 7.2.6 says that the most common way to prevent 
context switching during a critical section is to disable interrupts. 
In section 7.2.5, Douglass says: “The designers and programmers 
must show good discipline in ensuring that every resource access 
locks the resource before performing any manipulation of the 
source.” (Douglass 2002) 

MISRA recommends that developers “Use Test-and-Set 
instructions or a signaling mechanism, such as 
Dekker/Dijkstra/Lamport Semaphores, to protect and mark as 
‘in-use’ any common resources.” (MISRA Report 3 p. 26) In more 
modern terminology, this is a recommendation to use a mutex or 
related semaphore-based “lock” on data. MISRA also cautions 
that interrupt enable and disable instructions must be used with 
care, (id.) 

Sullivan presents results of a study of defect types, concluding 
that 11% of high impact memory corruption errors are due to 
concurrency defects. (Sullivan 1991, p. 6). This means that while 
most defects are easier to track down, a few race conditions and 
other concurrency defects can be expected to happen. 

Many of (he non-overlay errors were concurrency- 
related. Common errors include deadlocks, sequence errors 
(programmer assumes that external events will always occur in 
a certain order), undefined state (programmer assumes an exter¬ 
nal event will never occur), and synchronization errors which 
together amount to almost 30 percent of the errors. 

(Sullivan 1991, p. 7) 

Concurrency defects are so difficult to find that specific testing 
and analysis tools have been developed to find them, prompting 
the creation of a benchmark suite to evaluate such tools (Jalbert 
2011). A significant challenge to creating such a benchmark is the 
difficulty in reproducing such bugs even when the exact bug is 
completely understood, (id., first page) 

Park et al. provide a summary of work on finding and fixing 
concurrency defects (Park 2010). Among other things they note 
that a concurrency defect was responsible in part for the 2003 
Northeastern US electricity blackout that left 10 million people 
without power (id. p. 245), and that such bugs are difficult to 
reproduce (id.). 
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Monday, June 16, 2014 
Ensure Code Is Modular 

Critical embedded software should be modular, and in particular 
should limit function size to no more than 1-2 pages of code 
including comments. 

Consequences: Poor modularity can reasonably be expected to 
result in code that is more complex, more brittle, and in general 
has a higher defect rate than modular code. This is in part 
because code with poor modularity it is harder to understand 
(and therefore harder to get right), and in part because it is more 
difficult to test than code with good modularity. 

Accepted Practices: 


• A significant majority of functions, procedures, or 
subroutines must each be no more than a defined length in 
the range of 100-225 lines long, nominally 120 lines long, 
including comments. Longer functions should be rare, and 
their length should be specifically justified, taking into 
account whether they have low cyclomatic complexity or 
other mitigating factors. 

• Each module should have high cohesion, low coupling, and 
good information hiding. There are no generally accepted 
precise values for these metrics. 


Discussion: 


Code is considered modular if it is written in reasonably small 
chunks of code (modules) that work together in a way that 
maintains some key properties. Code that is merely in small 
chunks but that does not achieve these properties is not 
considered modular. A code “module” is classically defined as a 
piece of code that accomplishes a particular function. This may 
be either a single software subroutine (or equivalent) with 
associated variables, or may be a small set of collected 
subroutines that work together to accomplish a function or a set 
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of closely related functions. Key properties of modular code 
include the following areas. 

Small size: each function should be one or two pages of code 
when printed (up to say 120 lines of code including comments 
maximum, but with most functions being less than one page of 
60-75 lines including comments). Having a function size larger 
than this makes it difficult for developers and reviewers to 
understand the entirety of the function at one time, decreasing 
the effectiveness of finding bugs. Large size also makes it harder 
to create effective unit tests due to the increased complexity of a 
large function. (This length limit is historically associated with a 
function, in terms of a single function fitting on one printed page 
of about 65 lines.) 

Hig h cohesion: each module performs a single job and contains a 
set of closely related functions rather than a collection of 
unrelated functions. Having low cohesion results in code that 
mixes up unrelated functions, making it difficult to test, and 
difficult to maintain as changes need to be made. Low cohesion 
also normally leads to bugs due to related functions being 
scattered around many modules instead of in a single module. 

Low couplin g: modules should have a low dependency upon 
other module data. High coupling causes data dependencies 
among modules that makes testing difficult and tends to lead to 
subtle data sharing bugs. Good practice is to minimize coupling 
via avoiding global variables. High coupling can be reasonably 
expected to increase defect rates. 

Information hidin g: modules should hide as much about the 
details of their implementation as possible to reduce implicit 
dependencies that can cause other parts of the software to fail. 
For example, the exact algorithm or data format used for 
calculations internal to a module should not be discernible to any 
other module. Accepted practice is to use the “static” keyword on 
identifiers within a .c file to accomplish information hiding, and 
declare variables within a procedure, instead of globally, if they 
are only needed within that procedure. 

Selected Sources: 


Fenton & Neil discuss the topic of software metrics (1999), and 
say that to really understand the problem you need to consider 
many factors at once rather than simple metrics. They give an 
example a combination of problem complexity, use of IEC1508 
(draft version of IEC 61508), code complexity, and operational 
usage among others (Fenton Fig 3, pg. 684) as a way to ensure 
good code. In other words, it is clear that these types of factors 
matter, even if there is no single optimal answer. 

McConnell presents a survey of function length practices. A 
length limit of one to two printed pages is common in industry. 
McConnell says: “If you want to write routines longer than about 
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200 lines, be careful. (A line is a non-comment nonblank line of 
source code.)” He also says: “you’re bound to run into an upper 
limit of understandability as you pass 200 lines of code” 

(McConnell 1993, pp. 93-94). These comments are primarily in 
the context of desktop applications rather than safety critical 
embedded control software. In my opinion a smaller limit than 
200 lines of non-comment code is considered acceptable practice 
for safety critical embedded systems. McConnell, chapter 5, talks 
more generally about modular programming in terms of 
information hiding, coupling, cohesion, and function length. 

If you want to write routines longer than about 200 lines, 
a noncomment, nonblank line of source code.) None 
reported decreased cost, decreased error rates, or both 
distinguished among sizes larger than 200 lines, and you' 
an upper limit of understandability as you pass 200 lines < 
the code for IBM’s OS/ 36 O operating system and other sys 
prone routines w-ere those that were larger than 500 lines 
lines, the error rate tended to be proportional to the size < 
1986a). Moreover, an empirical study of a 148,000-line 
routines with fewer than 143 source statements were 2.4 
to fix than larger routines (Selby and Basili 1990. 

(McConnell 1993 p. 


Selby & Basili found that software modules with high coupling 
and low coherence had 7.0 times as many errors (per lines of 
non-comment code) as modules with low coupling and high 
coherence. (Selby et al. 1991, abstract, p. 146). 

Low coupling and high strength are desirable. 

• Routines with the lowest coupling/strength ratios had 7.0 
(= 4.14/0.59 from Fig. 7) times fewer errors per KNCSS 
than routines with the highest coupling/strength ratios and 
errors were 21.7 (= 9.15/0.42 from Fig. 7) times less costly 
to fix. 


(Selby 1991 p. 149) 


Withrow found that the optimum size for minimizing defects was 
225 lines of code per module, which is a bit larger than some 
guidelines (Withrow 1990), but still in the same general range 
being discussed. 

The NASA “Power of Ten” rules for writing safety critical code 
recommend functions of at most 60 lines of text (Holzman 06, 
rule 4). They also recommend declaring data objects at the 
smallest possible level of scope (id., rule 6). 
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IEC 61508-3 highly recommends a module size limit (p. 93). 

MISRA states that modularity requires high cohesion and low 
coupling (MISRA Report 5, pp. 9-10). 
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Monday, June 9, 2014 

Minimize Use of Global Variables 

Critical embedded software should use the minimum practicable 
variable scope for each variable, and should minimize use of 
global variables. As one of the chapters in my book says, Global 
Variables Are Evil. Over-use of globals can reasonably be 
expected to result in significantly increased defect rates and the 
presence of difficult-to-find software defects that are likely to 
render a system unsafe. 

Consequences: Using too many global variables increases 
software complexity and can be reasonably expected to increase 
the number of bugs, as well as make the code difficult to 
maintain properly. Defining variables as globals that could 
instead be defined as locals can be reasonably expected to 
significantly increase the risk of the data being improperly used 
globally, as well as to make it more difficult to track and analyze 
the few variables that should legitimately be global. In short, 
excessive use of globals leads to an increased number of software 
defects. 

Accepted Practices: 
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• A significant minority of variables (or fewer) should be 
global. Ideally zero variables should be g lobal . (Special 
globals such as mathematical constants and configuration 
information might be excluded from this metric.) The exact 
number varies with the system, but an expected range 
would be from less than 1 % to perhaps 10% of all statically 
allocated variables (even this is probably too high), with an 
extremely strong preference for the lower side of that range 
Exceeding this range can reasonably be expected to lead to 
an increase in software defects. 

• The need for each global or category of globals should be 
specifically justified as required for effective software 
construction. Speed increases are generally not sufficient 
justification, nor is limited memory space. 

• In any system with more than one task (including systems 
that just have a main task plus interrupts), every non¬ 
constant global should be declared volatile, and every 
access to a global should be protected by a form of 
concurrency management (e.g., disabling interrupts or 
using a mutex). Failing to do either can normally be 
expected to result in concurrency defects somewhere in 
your code. 

• Each variable should be declared locally if possible. 
Variables used by a collection of functions in the same C 
programming language module should be declared as top- 
level “static” within the corresponding .c file to limit 
visibility to functions declared within that .c file. Variables 
used by only one function should be declared within that 
function and thus not be visible to any other function. 

• Global variables must be declared consistently throughout 
the system, including having exactly the same type 
information (e.g., if a global is declared as “unsigned” in 
one place, it needs to be “unsigned” everywhere). 

Discussion: 


Global variables (often called “globals” for short) are 
programming language variables that are visible to the entirety of 
the program. In contrast, non-global variables (often called local 
variables, or locals for short), have a reduced scope that makes 
them visible only in the part of the program where they are used, 
and invisible to the rest of the program. Excessive use of global 
variables must be avoided in safety critical software due to the 
complexity introduced as well as the risk of concurrency hazards. 

One reason to avoid globals is that use of many globals can be 
reasonably expected to lead to high coupling among many 
disparate portions of a program. Any variable that is globally 
visible might be read or written from anywhere in the code, 
increasing program complexity and thus the chance for software 
defects. While analysis might be performed to determine where 
globals actually are read and written, doing so for a large number 
of globals is time consuming, and must be re-done any time any 
substantive change is made to the code. (For example, it would 
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be expected that such analysis would have to be re-done for each 
software release.) 

Another reason to avoid globals is that they introduce 
concurrency hazards (see the discussion on concurrency in an 
upcoming post). Because it is difficult to keep track of what parts 
of the program are reading and writing a global, safe code must 
assume that other tasks can access the global and use full 
concurrency protection each time a global is referenced. This 
includes both locking access to the global when making a change 
and declaring the global “volatile” to ensure any changes 
propagate throughout the software. 

Even if concurrency hazards are generally avoided, if more than 
one place in the program modifies a global it is easy to have 
unexpected software behavior due to two portions of the program 
modifying the globals’ value in an unanticipated sequence. This 
can be reasonably expected to lead to infrequent and subtle (but 
potentially severe) concurrency defects. 

As an analogy, consider if you keep your grocery shopping list on 
a whiteboard on your fridge. If you live alone and are careful, this 
may work out fine, and you will never accidentally have an item 
erased or added in error to your list. But if you live in a busy 
house (perhaps a dormitory), something you wrote on that 
whiteboard might be changed or accidentally erased, and you 
might have no idea that it happened or who did it. In this 
analogy, everything you write on that whiteboard is a global 
variable - anyone who enters the house can see it, act upon the 
information, and potentially modify it. As an extension of this 
analogy, if that whiteboard is a “to do” list, you are using it to 
communicate information to others in the house. If that list is 
modified, corrupted, or overwritten, your communication will 
fail, with that sort of problem becoming more likely as more and 
more people share the whiteboard for a diversity of purposes. 

Using too many globals can be thought of as the data flow version 
of spaghetti code. With code “control” flow (conditional “if’ 
statements and the like) that is tangled, it is difficult to follow the 
flow of control of the software. Similarly, with too many globals it 
is difficult to follow the flow of data through the program - you 
get “spaghetti data.” In both cases (tangled data flow and tangled 
control flow) designers can reasonably be expected that the 
spaghetti software will have elevated levels of software defects. 
Excessive use of global variables makes unit testing difficult, 
because it requires identifying and setting specific values in all 
the globals referenced by a module that is being unit tested. 

In some cases the risk of a global may seem only theoretical 
rather than practical, if a variable is only used in a single module 
but happens to be defined as a global. However, this still presents 
a latent risk of some other piece of the software modifying the 
variable intentionally (defeating the principle of information 
hiding), or by accident via a typographical error that was 
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intended to refer to a similarly named variable defined 
elsewhere. Therefore, it is an important accepted practice to only 
define variables as global if it is essential to do so. 

There are two related classes of globals that are an exception, and 
are generally considered acceptable for use. One class is 
constants that represent physical properties (e.g., the number of 
degrees in a circle being 360), system configuration information 
(e.g., the fact that a vehicle has a 6 cylinder engine or a 4-speed 
transmission), and other values that don’t change at run time. 
These are properly defined as constant values using the 
programming language keyword “const” in the code or other 
similar approach. 

Another class of generally acceptable globals is system state 
information that must be generally accessible across most 
software, such as engine speed in a vehicle. These system state 
variables should be written in exactly one place in the code, 
should not be duplicated (to avoid confusion and the chance for 
different copies of the variable to get out of synch), should be 
kept few in number, and ideally should be read via access 
functions rather than directly to enforce modularity. These are 
the sorts of globals that should be kept to a minimum and count 
toward the statement that having a few well-chosen globals might 
be OK. In distributed embedded systems these globals usually 
show up as broadcast bus values rather than as actual memory- 
based global variables. 

Entirely eliminating the use of non-constant globals is sometimes 
unachievable. But when complete elimination isn’t practical, 
globals should nonetheless be very few (handfuls, not thousands 
of them) and strategically selected. The globals that are used 
should be used carefully, and reviewed to ensure each is 
essential. 

Selected Sources: 


Wulf & Shaw in 1973 wrote a paper entitled: “Global variable 
considered harmful,” (Wulf 1973) describing it as the data 
version of a “goto” statement - which a previous paper by 
Dijkstra had famously declared “harmful.” After that publication, 
the concept of avoiding globals has mostly been a matter of 
clarifying the gray areas and educating new programmers. 


1973 February 


GLOBAL VARIABLE 

W. Wulf, Mary Shaw 
Carnegie-Mellon University 


The problems of indiscriminant access and vulnerability « 

former reflects the fact that the declaror has no control over 
the latter reflects the fact that the program itself has no contr 
it operates on. Both problems force upon the programmer t 
global knowledge of the program which is not consistent with h 
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McConnell says: “Global data is like a love letter between 
routines - it might go where you want it to go, or it might get lost 
in the mail.” (McConnell 1993, pg. 88). McConnell also says: 
“global-data coupling is undesirable because the connection 
between routines is neither intimate nor visible. The connection 
is so easy to miss that you could refer to it as information hiding's 
evil cousin - ‘information losing.’” (McConnell 1993, pg. 90). 

Global-data coupling. Two routines are global-data-cou; 
of the same global data. This is also called “common 
coupling.” If use of the data is read-only, the practice is 
however, global-data coupling is undesirable because 
tween routines is neither intimate nor visible. The com 
miss that you could refer to it as information hiding’s ev 
tion losing.” (McConnell 1993, p. 90; book on acc 

Ganssle advises “Minimize global variables, both to reduce 
interaction between routines and to reduce the number of places 
where interrupts will cause problems.” (Ganssle 1992, pg. 186). A 
few years later, Ganssle and designers in general appreciated that 
global variables were even more of a problem than previously 
thought. He then wrote, resorting to humor to make his point: 

“One of the greatest evils in the universe, an evil in part 
responsible for global warming, ozone depletion, and male 
pattern baldness, is the use of global variables.” Humor aside, he 
goes on to confirm the problems with globals, and gives 
recommendations. “Every firmware standard - backed up by the 
rigorous checks of code inspections - must set rules about global 
use.” He finishes by saying: “I feel that defining a global is such a 
source of problems that the team leader should approve every 
one.” (Ganssle 2000, p. 38) 

NASA recommends avoiding too many inter-component 
dependencies: “components should not depend on each other in 
a complex way,” which includes as a significant factor the use of 
globals to communicate data between components (NASA 2004, 

P- 93 )- 


MISRA Software Guidelines recommend against using global 
variables (MISRA Report 5, p. 10). IEC 61508-3 highly 
recommends a modular approach (p. 79), information 
hiding/encapsulation (id., p. 93), and a fully defined interface 
(id., p. 93), which sum up to recommending against the use of 
global variables. 

U pdate : you can find an example of getting rid of a global 
variable here: 

http://betterembsw.blogspot.com/20i3/09/getting-rid-of- 

global-variables.html 
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